Red Hat Enterprise Linux 7 Troubleshooting

Backups and Disaster Recovery

Modules Topics

  • Backup Strategy

  • Backups and Virtualization

  • Disaster Recovery

Backup Strategy

  • A solid backup plan can pay off in these instances:

    • When a system malfunctions and files are lost

    • When a user or the system administrator deletes or corrupts a file by accident

    • When a catastrophic disaster occurs

  • Even with disk redundancy, you still need to back up files.

    • RAID provides fault tolerance, but it does not help when a disaster occurs or when a file is corrupted or accidentally removed.

  • Red Hat Enterprise Linux has several low-level backup options that can handle files or disks on a per-system basis, such as dd, tar, cpio, and dump.

    • Use of these tools alone is not considered a solid enterprise backup strategy.

  • Red Hat Enterprise Linux includes two open source enterprise-level backup solutions: Amanda and Bacula.

    • These solutions use a client/server architecture and are considered usable for enterprise backup purposes.

  • There are many other enterprise-class backup solutions that work with Red Hat Enterprise Linux, such as Symantec NetBackup and Arconis.

Choosing a Backup Strategy

  • Choosing the right solution is up to the individual organization’s needs, budget, and expertise.

  • A good enterprise backup solution for a Red Hat Enterprise Linux environment should:

    • Support Red Hat Enterprise Linux without much effort

    • Use client/server architecture

  • Features that are nice to have:

    • Support for multiple platforms

    • API or command line interface for scripting

    • Ability to take media offline and track that media

Backups and Virtualization

  • Virtual systems are no longer tied to hardware.

    • Baremetal restoration is no longer required.

    • Baremetal systems only need to run as hypervisors.

    • Baremetal system do not need to store stateful data.

  • A virtual system is essentially the same as a file on disk.

    • There are a myriad of methods for creating full backups of entire systems.

    • Real-time replication of entire systems is now possible.

    • Entire systems can fail between physical locations without a restore.

Backup Challenges and Virtualization

  • The ability to restore specific files on demand is still required.

  • Data deduplication economizes space and I/O, by not backing up data twice, when backing up an entire virtual disk or file system.

Disaster Recovery Overview

Disasters can be classified in two broad categories:

  • Natural disasters such as floods, hurricanes, tornadoes, or earthquakes.

    • You cannot prevent a natural disaster.

    • You can reduce or avoid losses with good mitigation planning.

  • Man-made disasters such as hazardous material spills, infrastructure failure, or bio-terrorism.

    • Surveillance and mitigation planning are invaluable for avoiding or lessening losses.

Disaster Recovery Planning

  • 43% of companies that experience a major loss of business data never reopen.

  • 29% of companies that experience a major data loss close within 2 years.

  • Preparing for continuation or recovery of systems must be taken very seriously.

    • Involves a significant investment of time and money

    • Aims to ensure minimal losses if a disruptive event occurs

  • A disaster recovery plan (DRP) is the overall plan to recover from a disaster.

    • A DRP includes planning for the resumption of applications, data, hardware, electronic communications (such as networking), and other IT infrastructure.

    • Control measures are steps or mechanisms that reduce or eliminate various threats.

Disaster Recovery Control Measures

  • Preventive - Aimed at preventing an event from occurring.

  • Detective - Aimed at detecting or discovering unwanted events.

  • Corrective - Aimed at correcting or restoring the system after a disaster or an event.

Disaster Recovery Objectives

  • Define DR objectives that indicate key metrics for various business processes.

  • Two commonly defined objectives are:

    • Recovery point objective (RPO)

    • Recovery time objective (RTO)

Recovery Point Objective

  • Recovery point objective (RPO) - Maximum tolerable period in which data might be lost from an IT service due to a major incident.

  • RPO gives systems designers a limit to work to.

image

Recovery Time Objective

  • Recovery time objective (RTO) - Duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.

  • RTO can include time for trying to fix the problem without a recovery, the recovery itself, testing, and communication to users.

  • RTO is the longest period of time the business can do without the IT Service in question.

  • Invoking a recovery plan relies on decision making and the availability of resources to execute the plan.

  • RTO must account for the time required to set the recovery plan in motion.

image

Disaster Recovery RTO/RPO Strategy

  • Incomplete RTOs and RPOs can quickly derail a disaster recovery plan.

  • Every item in the DR plan requires a defined recovery point and time objective.

  • Failure to create RTOs and RPOs may lead to significant problems that can extend the disaster’s impact.

  • After RTO and RPO metrics are mapped to IT infrastructure, the DR planner can determine the most suitable recovery strategy for each system.

Disaster Recovery Budget

  • RTO and RPO metrics need to fit within the available budget.

  • Most businesses would like zero data loss and zero time loss, but the cost associated with that level of protection may make the desired solutions impractical.

  • A cost-benefit analysis often dictates which disaster recovery measures are implemented.

Common Disaster Recovery Data Protection Strategies

  • Backups made to tape and sent offsite at regular intervals.

  • Backups made to disk onsite and automatically copied to an offsite disk, or made directly to an offsite disk.

  • Replication of data at the disk frame level to an offsite location over a high-speed WAN, which overcomes the need to restore the data (only the systems need to be restored or synchronized).

  • Use of high-availability systems which keep both the data and systems replicated offsite.

    • Enables continuous access or brief cutover time to systems and data.

    • Doable using virtualization for all systems in an enterprise where baremetal systems are only used as hypervisors.

  • Use an outsourced disaster recovery provider that provides a stand-by site and systems.

Disaster Recovery: Avoiding Disaster

  • Local mirrors of systems and/or data

  • Disk protection technology, such as RAID

  • Surge protectors

  • Uninterruptible power supply (UPS) and/or backup generator

  • Fire prevention/mitigation systems, such as alarms and fire extinguishers

  • Strong physical security to limit access to hardware

  • Adequate heating or cooling system, even during power outages

Module Completion

Nice job!

Click the button below to complete this module of the course: